Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. Initially, code-switching is intensively explored by leveraging linguistic theories and, currently, more machine-learning oriented approaches to develop models. We introduce a comprehensive systematic survey on code-switching research in natural language processing to understand the progress of the past decades and conceptualize the challenges and tasks on the code-switching topic. Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.
translated by 谷歌翻译
We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
The BLOOM model is a large open-source multilingual language model capable of zero-shot learning, but its pretraining was limited to 46 languages. To improve its zero-shot performance on unseen languages, it is desirable to adapt BLOOM, but previous works have only explored adapting small language models. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages. Our code is available at \url{https://github.com/bigscience-workshop/multilingual-modeling/}.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
在阻止印尼自然语言处理(NLP)研究进步的基本问题的中心,我们发现数据稀缺。印尼语言,尤其是当地语言的资源极为稀缺和代表性不足。许多印尼研究人员没有发布其数据集。此外,我们拥有的少数公共数据集散布在不同的平台上,因此使印尼NLP的可重复性和以数据为中心的研究更加艰巨。面对这一挑战,我们开始了第一个印尼NLP众包努力,Nusacrowd。Nusacrowd努力为所有印尼语言中的NLP任务提供标准化数据加载,以提供最大的数据表聚合。通过使印尼NLP资源的开放式和集中式访问能力,我们希望Nusacrowd可以解决阻碍印度尼西亚NLP进展的数据稀缺问题,并将NLP从业者带来合作。
translated by 谷歌翻译
Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model. We provide pretrained models and audio samples of Nix-TTS.
translated by 谷歌翻译
我们从任务特定的BERT基教师模型执行知识蒸馏(KD)基准到各种学生模型:Bilstm,CNN,Bert-Tiny,Bert-Mini和Bert-small。我们的实验涉及在两个任务中分组的12个数据集:印度尼西亚语言中的文本分类和序列标记。我们还比较蒸馏的各个方面,包括使用Word Embeddings和未标记的数据增强的使用。我们的实验表明,尽管基于变压器的模型的普及程度不断上升,但是使用Bilstm和CNN学生模型,与修剪的BERT模型相比,使用Bilstm和CNN学生模型提供了性能和计算资源(CPU,RAM和存储)之间的最佳权衡。我们进一步提出了一些快速胜利,通过涉及涉及丢失功能,Word Embeddings和未标记的数据准备的简单选择的高效KD培训机制来生产小型NLP模型。
translated by 谷歌翻译
The rapid development of technology has brought unmanned aerial vehicles (UAVs) to become widely known in the current era. The market of UAVs is also predicted to continue growing with related technologies in the future. UAVs have been used in various sectors, including livestock, forestry, and agriculture. In agricultural applications, UAVs are highly capable of increasing the productivity of the farm and reducing farmers' workload. This paper discusses the application of UAVs in agriculture, particularly in spraying and crop monitoring. This study examines the urgency of UAV implementation in the agriculture sector. A short history of UAVs is provided in this paper to portray the development of UAVs from time to time. The classification of UAVs is also discussed to differentiate various types of UAVs. The application of UAVs in spraying and crop monitoring is based on the previous studies that have been done by many scientific groups and researchers who are working closely to propose solutions for agriculture-related issues. Furthermore, the limitations of UAV applications are also identified. The challenges in implementing agricultural UAVs in Indonesia are also presented.
translated by 谷歌翻译